Citibike Data Case Study¶

Background: Citi Bike is New York City’s bike share system established in 2013. It has become a vital part of the transportation network in New York City. In this case study, we use citibike trip sample data from 2013-07 to 2016-08 to complete three tasks below:

  • Use Exploratory data analysis to summarize the Citibike usage statistics and find out of potential outliers or anomalies
  • Forecast the Citibike trips per day based on the historic usage data
  • Provide insights into the usage patterns of citibike stations
In [1]:
!pip install prophet
Requirement already satisfied: prophet in /usr/local/lib/python3.10/dist-packages (1.1.6)
Requirement already satisfied: cmdstanpy>=1.0.4 in /usr/local/lib/python3.10/dist-packages (from prophet) (1.2.4)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.10/dist-packages (from prophet) (1.26.4)
Requirement already satisfied: matplotlib>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from prophet) (3.8.0)
Requirement already satisfied: pandas>=1.0.4 in /usr/local/lib/python3.10/dist-packages (from prophet) (2.2.2)
Requirement already satisfied: holidays<1,>=0.25 in /usr/local/lib/python3.10/dist-packages (from prophet) (0.61)
Requirement already satisfied: tqdm>=4.36.1 in /usr/local/lib/python3.10/dist-packages (from prophet) (4.66.6)
Requirement already satisfied: importlib-resources in /usr/local/lib/python3.10/dist-packages (from prophet) (6.4.5)
Requirement already satisfied: stanio<2.0.0,>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from cmdstanpy>=1.0.4->prophet) (0.5.1)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from holidays<1,>=0.25->prophet) (2.8.2)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (4.55.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (24.2)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (11.0.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=2.0.0->prophet) (3.2.0)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.4->prophet) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.4->prophet) (2024.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil->holidays<1,>=0.25->prophet) (1.16.0)
In [27]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings
import folium
from folium.plugins import HeatMap, HeatMapWithTime
from IPython.display import display, HTML
from scipy.stats import percentileofscore
from sklearn.metrics import mean_absolute_error, mean_squared_error
from prophet import Prophet
import plotly.express as px
# import requests

# Ignore all warnings
warnings.filterwarnings('ignore')
# Adjust display options for pandas
pd.set_option('display.max_columns', 100)  # Show all columns
# Setting a professional style
sns.set_theme(style="dark")
In [3]:
from google.colab import output
output.enable_custom_widget_manager()
# output.disable_custom_widget_manager()

1. EDA¶

1.1 Import Data¶

In [4]:
def describe_data(df):
    summary = {}

    for col in df.columns:
        if pd.api.types.is_numeric_dtype(df[col]):  # Continuous variables
            summary[col] = {
                'Type': 'Continuous',
                'Count': df[col].count(),
                'Mean': df[col].mean(),
                'Median': df[col].median(),
                'Std': df[col].std(),
                'Min': df[col].min(),
                '25%': df[col].quantile(0.25),
                '50%': df[col].quantile(0.50),
                '75%': df[col].quantile(0.75),
                'Max': df[col].max(),
            }
        elif pd.api.types.is_categorical_dtype(df[col]) or df[col].dtype == object:  # Categorical variables
            summary[col] = {
                'Type': 'Categorical',
                'Count': df[col].count(),
                'Unique': df[col].nunique(),
                'Top': df[col].mode()[0] if not df[col].mode().empty else None,
                'Freq': df[col].value_counts().iloc[0] if not df[col].value_counts().empty else None
            }

    return pd.DataFrame(summary).T

1.1.1 Load citibike trip sample¶

In [5]:
citibike_trip_sample_df = pd.read_csv('data/citibike-trips-sample.csv')
citibike_trip_sample_df['starttime'] = pd.to_datetime(citibike_trip_sample_df['starttime'])
citibike_trip_sample_df['start_hour'] = citibike_trip_sample_df['starttime'].dt.hour
citibike_trip_sample_df['start_month'] = citibike_trip_sample_df['starttime'].dt.month
citibike_trip_sample_df['start_year'] = citibike_trip_sample_df['starttime'].dt.year

citibike_trip_sample_df['stoptime'] = pd.to_datetime(citibike_trip_sample_df['stoptime'])
citibike_trip_sample_df['stop_hour'] = citibike_trip_sample_df['stoptime'].dt.hour
citibike_trip_sample_df['stop_month'] = citibike_trip_sample_df['stoptime'].dt.month
citibike_trip_sample_df['stop_year'] = citibike_trip_sample_df['stoptime'].dt.year
citibike_trip_sample_df[['start_station_id', 'end_station_id']] = citibike_trip_sample_df[['start_station_id', 'end_station_id']].astype(str)
citibike_trip_sample_df['user_age'] = citibike_trip_sample_df['start_year'] - citibike_trip_sample_df['birth_year']
citibike_trip_sample_df.shape
Out[5]:
(315785, 23)
In [6]:
# Check the time range of the trip sample data
print("Start Time Range: {} to {}".format(citibike_trip_sample_df.starttime.min(), citibike_trip_sample_df.starttime.max()))
print("Stop Time Range: {} to {}".format(citibike_trip_sample_df.stoptime.min(), citibike_trip_sample_df.stoptime.max()))
Start Time Range: 2013-07-01 00:02:00 to 2016-08-30 23:33:00
Stop Time Range: 2013-07-01 00:12:00 to 2016-08-30 23:50:00
In [7]:
citibike_trip_sample_df.head()
Out[7]:
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth_year gender customer_plan start_hour start_month start_year stop_hour stop_month stop_year user_age
0 2319 2016-03-09 13:08:00 2016-03-09 13:47:00 520 W 52 St & 5 Ave 40.759923 -73.976485 363 West Thames St 40.708347 -74.017134 23062 Subscriber 1972.0 male NaN 13 3 2016 13 3 2016 44.0
1 313 2015-07-09 15:42:00 2015-07-09 15:47:00 520 W 52 St & 5 Ave 40.759923 -73.976485 493 W 45 St & 6 Ave 40.756800 -73.982912 16909 Subscriber 1968.0 female NaN 15 7 2015 15 7 2015 47.0
2 906 2016-01-11 18:32:00 2016-01-11 18:47:00 520 W 52 St & 5 Ave 40.759923 -73.976485 3162 W 78 St & Broadway 40.783400 -73.980931 15614 Subscriber 1961.0 male NaN 18 1 2016 18 1 2016 55.0
3 716 2013-10-30 11:53:00 2013-10-30 12:05:00 520 W 52 St & 5 Ave 40.759923 -73.976485 533 Broadway & W 39 St 40.752996 -73.987216 19280 Subscriber 1954.0 male NaN 11 10 2013 12 10 2013 59.0
4 312 2014-06-04 16:12:00 2014-06-04 16:17:00 520 W 52 St & 5 Ave 40.759923 -73.976485 519 E 42 St & Vanderbilt Ave 40.752416 -73.978370 16483 Subscriber 1963.0 male NaN 16 6 2014 16 6 2014 51.0
In [8]:
describe_data(citibike_trip_sample_df)
Out[8]:
Type Count Mean Median Std Min 25% 50% 75% Max Unique Top Freq
tripduration Continuous 315785 938.498763 631.0 6534.949983 60 391.0 631.0 1051.0 1733173 NaN NaN NaN
start_station_id Categorical 315785 NaN NaN NaN NaN NaN NaN NaN NaN 575 519 3374
start_station_name Categorical 315785 NaN NaN NaN NaN NaN NaN NaN NaN 599 8 Ave & W 31 St 2812
start_station_latitude Continuous 315785 40.735735 40.736529 0.020822 40.646678 40.721101 40.736529 40.750977 40.804213 NaN NaN NaN
start_station_longitude Continuous 315785 -73.989392 -73.990093 0.014088 -74.017134 -73.999318 -73.990093 -73.981281 -73.929891 NaN NaN NaN
end_station_id Categorical 315785 NaN NaN NaN NaN NaN NaN NaN NaN 584 519 3028
end_station_name Categorical 315785 NaN NaN NaN NaN NaN NaN NaN NaN 607 E 17 St & Broadway 3012
end_station_latitude Continuous 315785 40.735366 40.736502 0.075412 0.0 40.720874 40.736502 40.750664 40.804213 NaN NaN NaN
end_station_longitude Continuous 315785 -73.989326 -73.990214 0.13242 -74.017134 -73.999744 -73.990214 -73.981346 0.0 NaN NaN NaN
bikeid Continuous 315785 19074.89687 18779.0 2888.180823 14529 16687.0 18779.0 21090.0 26884 NaN NaN NaN
usertype Categorical 315785 NaN NaN NaN NaN NaN NaN NaN NaN 2 Subscriber 277448
birth_year Continuous 276894 1976.528812 1979.0 11.449689 1885.0 1969.0 1979.0 1986.0 2000.0 NaN NaN NaN
gender Categorical 315785 NaN NaN NaN NaN NaN NaN NaN NaN 3 male 211692
customer_plan Continuous 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
start_hour Continuous 315785 13.969216 15.0 4.881489 0 10.0 15.0 18.0 23 NaN NaN NaN
start_month Continuous 315785 7.100112 7.0 2.80401 1 5.0 7.0 9.0 12 NaN NaN NaN
start_year Continuous 315785 2014.693956 2015.0 1.036277 2013 2014.0 2015.0 2016.0 2016 NaN NaN NaN
stop_hour Continuous 315785 14.129157 15.0 4.927257 0 10.0 15.0 18.0 23 NaN NaN NaN
stop_month Continuous 315785 7.100157 7.0 2.804024 1 5.0 7.0 9.0 12 NaN NaN NaN
stop_year Continuous 315785 2014.693966 2015.0 1.036278 2013 2014.0 2015.0 2016.0 2016 NaN NaN NaN
user_age Continuous 276894 38.160982 36.0 11.439389 16.0 29.0 36.0 46.0 131.0 NaN NaN NaN

1.1.1 Load citibike stations data¶

In [9]:
citibike_station_df = pd.read_csv('data/citibike-stations.csv')
citibike_station_df[['station_id_int', 'region_id', 'eightd_has_key_dispenser', 'is_installed', 'is_renting', 'is_returning', 'eightd_has_available_keys']] = citibike_station_df[['station_id_int', 'region_id', 'eightd_has_key_dispenser', 'is_installed', 'is_renting', 'is_returning', 'eightd_has_available_keys']].astype(str)
citibike_station_df.shape
Out[9]:
(1802, 17)
In [10]:
citibike_station_df.head()
Out[10]:
station_id_int name short_name latitude longitude region_id rental_methods capacity eightd_has_key_dispenser num_bikes_available num_bikes_disabled num_docks_available num_docks_disabled is_installed is_renting is_returning eightd_has_available_keys
0 3664 North Moore St & Greenwich St 5470.12 40.720195 -74.010301 71 KEY, CREDITCARD 29 False 2 1 0 29 True True True False
1 4682 W 34 St & Hudson Blvd E 6535.04 40.755167 -74.000599 71 KEY, CREDITCARD 57 False 2 1 52 2 True True True False
2 3233 E 48 St & 5 Ave 6626.01 40.757246 -73.978059 71 KEY, CREDITCARD 62 False 2 1 43 14 True True True False
3 244 Willoughby Ave & Hall St 4611.03 40.691960 -73.965369 71 KEY, CREDITCARD 51 False 12 1 32 6 True True True False
4 525 W 34 St & 11 Ave 6578.01 40.755942 -74.002116 71 KEY, CREDITCARD 80 False 35 1 22 19 True True True False
In [11]:
# citibike_station_df.region_id.value_counts()
In [12]:
describe_data(citibike_station_df)
Out[12]:
Type Count Unique Top Freq Mean Median Std Min 25% 50% 75% Max
station_id_int Categorical 1802 1802 116 1 NaN NaN NaN NaN NaN NaN NaN NaN
name Categorical 1802 1802 1 Ave & E 110 St 1 NaN NaN NaN NaN NaN NaN NaN NaN
short_name Categorical 1802 1802 2733.03 1 NaN NaN NaN NaN NaN NaN NaN NaN
latitude Continuous 1802 NaN NaN NaN 40.723477 40.740629 0.961808 0.0 40.696093 40.740629 40.793693 40.88226
longitude Continuous 1802 NaN NaN NaN -73.909348 -73.947438 1.742434 -74.03281 -73.982605 -73.947438 -73.918636 0.0
region_id Categorical 1802 3 71 1800 NaN NaN NaN NaN NaN NaN NaN NaN
rental_methods Categorical 1802 1 KEY, CREDITCARD 1802 NaN NaN NaN NaN NaN NaN NaN NaN
capacity Continuous 1802 NaN NaN NaN 30.95616 25.0 17.046749 0 21.0 25.0 37.0 123
eightd_has_key_dispenser Categorical 1802 1 False 1802 NaN NaN NaN NaN NaN NaN NaN NaN
num_bikes_available Continuous 1802 NaN NaN NaN 11.933962 7.0 13.884131 0 2.0 7.0 18.0 80
num_bikes_disabled Continuous 1802 NaN NaN NaN 1.468923 1.0 1.718818 0 0.0 1.0 2.0 19
num_docks_available Continuous 1802 NaN NaN NaN 16.583241 16.0 13.525389 0 7.0 16.0 21.0 105
num_docks_disabled Continuous 1802 NaN NaN NaN 0.082686 0.0 1.063941 0 0.0 0.0 0.0 29
is_installed Categorical 1802 2 True 1737 NaN NaN NaN NaN NaN NaN NaN NaN
is_renting Categorical 1802 2 True 1736 NaN NaN NaN NaN NaN NaN NaN NaN
is_returning Categorical 1802 2 True 1736 NaN NaN NaN NaN NaN NaN NaN NaN
eightd_has_available_keys Categorical 1802 1 False 1802 NaN NaN NaN NaN NaN NaN NaN NaN

1.2 Explore Data Distributions and Identify Potential Outliers/Anomalies¶

1.2.1 Start Time and Stop Time¶

In [13]:
# Find out start time is later than stop time
# citibike_trip_sample_df.query('starttime>=stoptime').shape
citibike_trip_sample_df.query('starttime>=stoptime')
Out[13]:
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth_year gender customer_plan start_hour start_month start_year stop_hour stop_month stop_year user_age
18683 633 2015-11-01 01:50:00 2015-11-01 01:01:00 401 Allen St & Rivington St 40.720196 -73.989978 527 E 33 St & 2 Ave 40.744023 -73.976056 22787 Subscriber 1988.0 male NaN 1 11 2015 1 11 2015 27.0
  • There is one record where stoptime is earlier than startime. This is a clear anomalous data point that might be caused by entry error.

1.2.2 User Age¶

In [14]:
plt.figure(figsize=(10, 6))
sns.histplot(citibike_trip_sample_df[citibike_trip_sample_df['user_age'] <
                                     citibike_trip_sample_df['user_age'].quantile(0.999)],
             x='user_age', kde=True, color='#1f7ced', bins=30)
plt.title('Distribution of User Age (Zoomed-In)', fontsize=16)
plt.xlabel('User Age', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Save the plot as a PNG file
plt.savefig('image/distribution_of_user_age_(zoomed-in).png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [15]:
citibike_trip_sample_df['user_age'].quantile(0.99), citibike_trip_sample_df['user_age'].quantile(0.999)
Out[15]:
(67.0, 76.10700000001816)
In [16]:
print(citibike_trip_sample_df.query('user_age>=76').shape)
citibike_trip_sample_df.query('user_age>=76').head()
(339, 23)
Out[16]:
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth_year gender customer_plan start_hour start_month start_year stop_hour stop_month stop_year user_age
1636 1305 2014-10-18 08:25:00 2014-10-18 08:47:00 349 Rivington St & Ridge St 40.718502 -73.983299 467 Dean St & 4 Ave 40.683125 -73.978951 17955 Subscriber 1910.0 male NaN 8 10 2014 8 10 2014 104.0
3121 286 2016-07-28 18:41:00 2016-07-28 18:46:00 395 Bond St & Schermerhorn St 40.688070 -73.984106 3246 Montague St & Clinton St 40.694281 -73.992300 25540 Subscriber 1940.0 male NaN 18 7 2016 18 7 2016 76.0
3249 1411 2016-02-22 09:02:00 2016-02-22 09:25:00 446 W 24 St & 7 Ave 40.744876 -73.995299 475 E 16 St & Irving Pl 40.735243 -73.987586 22987 Subscriber 1940.0 female NaN 9 2 2016 9 2 2016 76.0
5975 976 2016-01-11 16:29:00 2016-01-11 16:45:00 3141 1 Ave & E 68 St 40.765005 -73.958185 3171 Amsterdam Ave & W 82 St 40.785247 -73.976673 20933 Subscriber 1939.0 male NaN 16 1 2016 16 1 2016 77.0
7346 2327 2014-11-10 13:37:00 2014-11-10 14:16:00 408 Market St & Cherry St 40.710762 -73.994004 302 Avenue D & E 3 St 40.720828 -73.977932 19900 Subscriber 1936.0 male NaN 13 11 2014 14 11 2014 78.0
  • The user age range that is below 99.9th percentile is between 16 and 76 years old. From statistical perspectives, any person whose age is above 76 years is considered as outliers and potential anomalies. But based on our experience, people in their 80s or even 90s, might still have the ability to ride bikes. Although it is not highly recommended. However, we do see some records in the table show the ages that are above 100. These are most likely the anomalies.
  • Based on the distribution of user age, most of the citibike users are between 20 and 60 years old. This is aligh with our experience that most users are daily commuters in the city.
  • Potential TODO Actions: If we do see significant increase of senior people using the citibike, more measures need to be taken to ensure their safety while riding。 Additionally, the user manual should be made more user-friendly for the seniors.

1.2.3 Gender and User Type Distribution¶

In [17]:
print(citibike_trip_sample_df.gender.value_counts())
print('----------------')
print(citibike_trip_sample_df.usertype.value_counts())
gender
male       211692
female      64887
unknown     39206
Name: count, dtype: int64
----------------
usertype
Subscriber    277448
Customer       38337
Name: count, dtype: int64
In [18]:
# Bar chart for categorical variables: gender and usertype
fig, axes = plt.subplots(1, 2, figsize=(14, 6))  # Two charts side by side

# Bar chart for Gender
sns.countplot(data=citibike_trip_sample_df, x='gender', ax=axes[0], palette='Blues')
axes[0].set_title('Distribution of Gender', fontsize=14)
axes[0].set_xlabel('Gender', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)

# Bar chart for Usertype
sns.countplot(data=citibike_trip_sample_df, x='usertype', ax=axes[1], palette='Blues')
axes[1].set_title('Distribution of Usertype', fontsize=14)
axes[1].set_xlabel('Usertype', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)

plt.tight_layout()

# Save the figure as a PNG file
plt.savefig('image/distribution_of_usertype_and_gender_bar_charts.png', dpi=300, bbox_inches='tight')
No description has been provided for this image
  • The number of male users significantly exceeds that of female users

    • The number of users from both genders should typically be similar. This deviation from the norm is unusual. If the data is accurate, it suggests there may be factors discouraging female users from utilizing Citibike.
    • There is one of the articles explained the potential reasons why females don't like riding Citi Bikes. "Why Aren’t More Women Riding Citi Bikes?"
    • Safety Concerns: Women often feel less safe navigating city traffic on bicycles, which can discourage them from using bike-sharing services.

    • Convenience Issues: The design of Citi Bikes and the docking stations may not accommodate women's clothing choices, such as skirts or dresses, making the service less convenient for them.

    • Maintenance and Cleanliness: Women may be more sensitive to the cleanliness and maintenance of shared bikes, and any deficiencies in these areas can be a deterrent.

    • Marketing and Outreach: The marketing strategies of bike-sharing programs may not effectively target women, leading to lower awareness and usage among this demographic.

  • Subscribers make up the majority of the overall user base

    • Convenience and Accessibility: Annual members have 24/7 access to thousands of bikes across multiple boroughs, providing a reliable transportation alternative to subways, taxis, and buses.
  • This also leads to another DISCUSSION: Whether Citi Bike wants more customers who purchase 24-hour pass or 7-day pass. According to the pricing plan, the more customers will bring more profit. On the other hand, this might lower the bike availability for subscribers who have the annual membership.

1.2.4 Trip Duration¶

In [ ]:
# Check the quantile and percentile of the trip duration
q1 = citibike_trip_sample_df['tripduration'].quantile(0.01)
q99 = citibike_trip_sample_df['tripduration'].quantile(0.99)
percentile_24hr = percentileofscore(citibike_trip_sample_df['tripduration'], 24*60*60, kind='rank') / 100

# Printing the information
print(f"1st Percentile (0.01 Quantile) of 'tripduration': {q1}")
print(f"99th Percentile (0.99 Quantile) of 'tripduration': {q99}")
print(f"Percentile rank of 24 hours (in seconds) for 'tripduration': {percentile_24hr:.4f}")
1st Percentile (0.01 Quantile) of 'tripduration': 119.0
99th Percentile (0.99 Quantile) of 'tripduration': 3879.0
Percentile rank of 24 hours (in seconds) for 'tripduration': 0.9998
In [19]:
plt.figure(figsize=(10, 6))
sns.histplot(citibike_trip_sample_df[citibike_trip_sample_df['tripduration'] <
                                     citibike_trip_sample_df['tripduration'].quantile(0.99)],
             x='tripduration', kde=True, color='#1f7ced', bins=30)
plt.title('Distribution of Trip Duration (Zoomed-In)', fontsize=16)
plt.xlabel('Trip Duration (seconds)', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.show()
# Save the figure as a PNG file
plt.savefig('image/distribution_of_trip_duration.png', dpi=300, bbox_inches='tight')
No description has been provided for this image
<Figure size 640x480 with 0 Axes>

Based on the analysis of tripduration data for CitiBike users:

  • Majority of Trips Are Under 1 Hour

    • The 1st percentile of trip duration is 119 seconds (just under 2 minutes), and the 99th percentile is 3,879 seconds (approximately 1 hour and 4 minutes). This indicates that the vast majority of trips are well within 1 hour.

    • This aligns with CitiBike’s pricing structure:

      • Single Ride Pass: The first 30 minutes are free, and additional minutes incur an extra fee.
      • Annual Membership: The first 45 minutes of each ride are included in the membership cost.

    The pricing model naturally incentivizes users to keep their trips shorter, resulting in most trips falling below 1 hour.

  • Trips Longer Than 24 Hours Are Extremely Rare

    • The percentile rank of trips lasting 24 hours (86,400 seconds) is 0.9998, highlighting that such trips are exceptionally uncommon.

    • When trips exceed 24 hours, it is likely due to:

      • Users forgetting to return the bike.
      • Situations where the bike cannot be returned due to technical or logistical issues.

    These extended trips should be treated as outliers in the data.

In [20]:
# Boxplot by Usertype
plt.figure(figsize=(12, 6))
sns.boxplot(data=citibike_trip_sample_df, x='usertype', y='tripduration', color='skyblue', palette='Set2')
plt.title('Boxplot of Trip Duration by Usertype', fontsize=16)
plt.xlabel('Usertype', fontsize=14)
plt.ylabel('Trip Duration (seconds)', fontsize=14)
plt.ylim(0, citibike_trip_sample_df['tripduration'].quantile(0.99))  # Focus on 99th percentile
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Save the figure as a PNG file
plt.savefig('image/boxplot_of_trip_duration_by_usertype.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

Boxplot of Trip Duration by Usertype

  • Subscribers:

    • The median trip duration for subscribers is shorter compared to customers, falling well under 1,000 seconds (around 16 minutes).
    • The interquartile range (IQR) is compact, indicating a relatively consistent trip duration pattern among subscribers.
    • Outliers exist but are fewer compared to customers, suggesting that subscribers tend to return bikes promptly.
  • Customers:

    • Customers have a higher median trip duration than subscribers, and their IQR is wider, suggesting greater variability in trip durations.
    • The presence of more extreme outliers for customers indicates occasional prolonged trips, potentially due to less familiarity with CitiBike’s policies or less time pressure.

Conclusion: Subscribers, who are likely regular users, exhibit shorter and more predictable trip durations, whereas customers have longer and more variable trip durations, with some extreme outliers.

In [21]:
# Boxplot by Gender
plt.figure(figsize=(12, 6))
sns.boxplot(data=citibike_trip_sample_df, x='gender', y='tripduration', color='skyblue', palette='Set2')
plt.title('Boxplot of Trip Duration by Gender', fontsize=16)
plt.xlabel('Gender', fontsize=14)
plt.ylabel('Trip Duration (seconds)', fontsize=14)
plt.ylim(0, citibike_trip_sample_df['tripduration'].quantile(0.99))  # Focus on 99th percentile
plt.grid(axis='y', linestyle='--', alpha=0.7)
# Save the figure as a PNG file
plt.savefig('image/boxplot_of_trip_duration_by_gender.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

Boxplot of Trip Duration by Gender

  • Male:

    • Males have the shortest median trip duration and the narrowest IQR among all gender categories, indicating highly consistent trip durations.
    • Few outliers suggest that male users tend to follow the typical usage pattern.
  • Female:

    • Females show a slightly higher median trip duration compared to males, with a slightly wider IQR.
    • Some outliers are present, indicating a small number of unusually long trips.
  • Unknown Gender:

    • Users with an unknown gender have the highest median trip duration and the widest IQR, indicating the most variable usage patterns.
    • This group also has the most significant presence of outliers, suggesting potential data anomalies or users with atypical trip behaviors.

Conclusion: Males exhibit the most consistent trip durations, while the "unknown" gender group has the most variability and the longest trips, including numerous outliers.

1.2.5 Geolocation Distribution of Citibike Station in Use Over Time¶

In [22]:
trip_start_hour_map_data = citibike_trip_sample_df.groupby(['start_station_name', 'start_station_latitude', 'start_station_longitude', 'start_hour']).size().reset_index(name='count')
trip_start_hour_map_data.columns = ['station_name', 'latitude', 'longitude', 'hour', 'count']
trip_end_hour_map_data = citibike_trip_sample_df.groupby(['end_station_name', 'end_station_latitude', 'end_station_longitude', 'stop_hour']).size().reset_index(name='count')
trip_end_hour_map_data.columns = ['station_name', 'latitude', 'longitude', 'hour', 'count']
trip_start_hour_map_data.shape, trip_end_hour_map_data.shape
Out[22]:
((11536, 5), (11607, 5))
In [23]:
trip_start_month_map_data = citibike_trip_sample_df.groupby(['start_station_name', 'start_station_latitude', 'start_station_longitude', 'start_month']).size().reset_index(name='count')
trip_start_month_map_data.columns = ['station_name', 'latitude', 'longitude', 'month', 'count']
trip_end_month_map_data = citibike_trip_sample_df.groupby(['end_station_name', 'end_station_latitude', 'end_station_longitude', 'stop_month']).size().reset_index(name='count')
trip_end_month_map_data.columns = ['station_name', 'latitude', 'longitude', 'month', 'count']
trip_start_month_map_data.shape, trip_end_month_map_data.shape
Out[23]:
((6052, 5), (6062, 5))
In [24]:
trip_start_hour_map_data.latitude.mean(), trip_start_hour_map_data.longitude.mean()
Out[24]:
(40.73075455494539, -73.98023185396151)
In [ ]:
# def draw_heatmap(df):
#   # Create a base map
#   m = folium.Map(location=[40.73, -73.98], zoom_start=12, width=800, height=800)
#   # Add heatmap to the map
#   heat_data = [[row['latitude'], row['longitude'], row['count']] for index, row in df.iterrows()]
#   HeatMap(heat_data, radius=15).add_to(m)
#   return m
In [ ]:
# def draw_heatmap_with_time(df, timeframe_col = 'hour'):
#     # Ensure 'hour' is sorted for proper time sequencing
#     df = df.sort_values(by=timeframe_col)

#     # Group data by hour and prepare heatmap data
#     heat_data = []
#     for hour, group in df.groupby(timeframe_col):
#         heat_data.append(
#             [[row['latitude'], row['longitude'], row['count']] for _, row in group.iterrows()]
#         )

#     # Create a base map
#     m = folium.Map(location=[40.73, -73.98], zoom_start=12, width=800, height=800)

#     # Add the HeatMapWithTime layer
#     HeatMapWithTime(
#         data=heat_data,
#         radius=20,
#         index=list(df[timeframe_col].unique()),  # Use unique 'hour' values as the time index
#         auto_play=True,  # Automatically play through frames
#         max_opacity=0.5,  # Reduce the opacity for a softer visual effect
#         scale_radius=False,# Keep the radius consistent
#         # scale_radius=True,
#         #  gradient={0.1: 'blue', 0.3: 'cyan', 0.5: 'green', 0.75: 'yellow', 1: 'red'}

#     ).add_to(m)

#     return m
In [ ]:
# display(draw_heatmap_with_time(trip_start_hour_map_data, 'hour'))
In [30]:
def draw_heatmap_with_time_plotly(df, title, timeframe_col='hour'):
    """
    Create an animated heatmap using Plotly to display usage over time with a constant scale.

    Parameters:
    - df: DataFrame with columns ['station_name', 'latitude', 'longitude', 'hour', 'count'].
    - timeframe_col: The column representing time (e.g., 'hour').

    Returns:
    - fig: A Plotly figure object.
    """
    # Sort by time for proper sequencing
    df = df.sort_values(by=timeframe_col)

    # Determine consistent size and color scales
    max_count = df['count'].max()

    # Create an animated scatter map
    fig = px.scatter_mapbox(
        df,
        lat='latitude',
        lon='longitude',
        size='count',
        color='count',
        animation_frame=timeframe_col,
        hover_name='station_name',
        size_max=15,  # Maximum size of the circle
        zoom=11,
        mapbox_style="carto-positron",
        title=title
    )

    # Set consistent size and color scales
    fig.update_traces(marker=dict(sizemin=3))  # Set a minimum circle size
    fig.update_layout(
        coloraxis_colorbar=dict(title="Usage Count"),
        coloraxis=dict(cmin=0, cmax=max_count),  # Constant color scale
        margin={"r": 0, "t": 50, "l": 0, "b": 0},
        width=900,  # Narrower width
        height=900  # Taller height
    )

    return fig
In [31]:
fig = draw_heatmap_with_time_plotly(trip_start_hour_map_data, 'Hourly Outflow of CitiBike Stations (Based on Start Stations)', 'hour')
fig.write_html("html/trip_start_hour_heatmap_animation.html")
fig.show()
In [32]:
fig = draw_heatmap_with_time_plotly(trip_end_hour_map_data, 'Hourly Inflow of CitiBike Stations (Based on End Stations)', 'hour')
fig.write_html("html/trip_stop_hour_heatmap_animation.html")
fig.show()
In [33]:
fig = draw_heatmap_with_time_plotly(trip_start_month_map_data, 'Monthly Outflow of CitiBike Stations (Based on Start Stations)', 'month')
fig.write_html("html/trip_start_month_heatmap_animation.html")
fig.show()
In [34]:
fig = draw_heatmap_with_time_plotly(trip_end_month_map_data, 'Monthly Inflow of CitiBike Stations (Based on End Stations)', 'month')
fig.write_html("html/trip_stop_month_heatmap_animation.html")
fig.show()

Station Locations From the hourly and monthly heatmap animations for inflow and outflow, it is evident that most CitiBike stations are concentrated in downtown and midtown areas.

Peak Usage Hours The hourly heatmap animations reveal that the busiest times for inflow and outflow occur during commuting hours: mornings (office commute) and evenings (returning from work). This aligns with expectations, as the majority of users are subscribers who primarily use CitiBike for daily commuting.

Seasonal Usage Patterns: The monthly heatmap animations indicate that CitiBike usage peaks during the summer months and reaches its lowest point in the winter. This pattern is consistent with seasonal behavior, as people ride less during cold winter months and prefer short, enjoyable trips during the warmer summer season.

1.2.6 Number of Trips By Month and By Day of Week¶

In [35]:
# Create a 'year_month' column
citibike_trip_sample_df['year_month'] = (
    citibike_trip_sample_df['start_year'].astype(str) + "-" +
    citibike_trip_sample_df['start_month'].astype(str).str.zfill(2)
)

# Group data by year_month and usertype
grouped_data = citibike_trip_sample_df.groupby(['year_month', 'usertype']).size().reset_index(name='num_trips')

# Pivot the data for plotting
pivot_data = grouped_data.pivot(index='year_month', columns='usertype', values='num_trips').fillna(0)

# Plot the data as separate bar plots for each user type
fig, ax = plt.subplots(figsize=(14, 8))
pivot_data.plot(kind='bar', ax=ax, color=['#1f77b4', '#ff7f0e'])

# Set plot details
ax.set_title('Number of Trips by Year-Month and User Type', fontsize=14)
ax.set_xlabel('Year-Month', fontsize=12)
ax.set_ylabel('Number of Trips', fontsize=12)
ax.legend(title='User Type', fontsize=10)
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save the figure as a PNG file
plt.savefig('image/number_of_trips_by_year-month_and_usertype.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [36]:
import calendar
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Map month numbers to abbreviated names
citibike_trip_sample_df['month_name'] = citibike_trip_sample_df['start_month'].map(
    lambda x: calendar.month_abbr[x]
)
monthly_trips = (
    citibike_trip_sample_df.groupby(['month_name', 'year_month', 'usertype'])
    .size()
    .reset_index(name='monthly_trips')
)

# Calculate the average number of trips per month for each year_month and usertype
avg_monthly_trips = (
    monthly_trips.groupby(['month_name', 'usertype'])['monthly_trips']
    .mean()
    .reset_index(name='avg_trips_per_month')
)

avg_monthly_trips['month_name'] = pd.Categorical(
    avg_monthly_trips['month_name'], categories=month_order, ordered=True
)
# Pivot the data for plotting
pivot_avg_monthly_trips = avg_monthly_trips.pivot(index='month_name', columns='usertype', values='avg_trips_per_month').fillna(0)

# Plot the average trips as a bar plot
fig, ax = plt.subplots(figsize=(14, 8))
pivot_avg_monthly_trips.plot(kind='bar', ax=ax, color=['#1f77b4', '#ff7f0e'])

# Customize the plot
ax.set_title('Average Number of Trips per Month by User Type', fontsize=14)
ax.set_xlabel('Month', fontsize=12)
ax.set_ylabel('Average Number of Trips per Month', fontsize=12)
ax.legend(title='User Type', fontsize=10)
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save the figure as a PNG file
plt.savefig('image/average_number_of_trips_per_month_by_usertype.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

Based on the two monthly bar charts, the following observations can be made:

  1. Seasonality:

    • Both subscribers and customers exhibit the same monthly seasonality, with higher usage during the warmer months and lower usage during the colder months. This indicates that seasonal factors influence CitiBike usage across all user types.
  2. Overall Trend:

    • There is an upward overall trend in CitiBike usage over time. Both subscriber and customer usage steadily increase, reflecting growing popularity and adoption of CitiBike as a mode of transportation.
  3. Implications for Forecasting:

    • These patterns of seasonality and overall growth will be key when performing time series forecasting. We will decompose the data into seasonal and trend components to better understand these factors and use them as the basis for predicting future CitiBike usage.
In [37]:
# Extract the day of the week
citibike_trip_sample_df['day_of_week'] = citibike_trip_sample_df['starttime'].dt.day_name()

# Extract the date part from 'starttime'
citibike_trip_sample_df['start_date'] = citibike_trip_sample_df['starttime'].dt.date
citibike_trip_sample_df['stop_date'] = citibike_trip_sample_df['stoptime'].dt.date

# Group by day_of_week, usertype, and start_date to calculate the number of trips per day
daily_trips = (
    citibike_trip_sample_df.groupby(['day_of_week', 'usertype', 'start_date'])
    .size()
    .reset_index(name='daily_trips')
)

# Calculate the average number of trips per day for each day_of_week and usertype
avg_trips = (
    daily_trips.groupby(['day_of_week', 'usertype'])['daily_trips']
    .mean()
    .reset_index(name='avg_trips_per_day')
)

# Ensure the days are in the correct order
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
avg_trips['day_of_week'] = pd.Categorical(avg_trips['day_of_week'], categories=day_order, ordered=True)

# Pivot the data for plotting
pivot_avg_trips = avg_trips.pivot(index='day_of_week', columns='usertype', values='avg_trips_per_day').fillna(0)

# Plot the average trips as a bar plot
fig, ax = plt.subplots(figsize=(14, 8))
pivot_avg_trips.plot(kind='bar', ax=ax, color=['#1f77b4', '#ff7f0e'])

# Customize the plot
ax.set_title('Average Number of Trips per Day of Week by User Type', fontsize=14)
ax.set_xlabel('Day of Week', fontsize=12)
ax.set_ylabel('Average Number of Trips per Day', fontsize=12)
ax.legend(title='User Type', fontsize=10)
ax.grid(axis='y', linestyle='--', alpha=0.7)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
# Save the figure as a PNG file
plt.savefig('image/average_number_of_trips_per_dayofweek_by_usertype.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image

Based on the chart showing the average number of trips per day of the week for each user type:

  1. Subscribers:

    • Subscribers show a consistent usage pattern throughout the weekdays (Monday to Friday), with a slightly lower average on weekends (Saturday and Sunday).
    • This indicates that subscribers primarily use CitiBike for commuting purposes during workdays.
  2. Customers:

    • Customers have a contrasting usage pattern, with higher usage on weekends (Saturday and Sunday) compared to weekdays.
    • This suggests that customers, who are likely casual users or tourists, primarily use CitiBike for leisure or recreational activities during the weekend.

2. Citibike Usage Forecast¶

2.1 Remove Outliers¶

In [38]:
# Remove outliers
citibike_trip_sample_clean_df = citibike_trip_sample_df.copy()
print(citibike_trip_sample_df.shape)
citibike_trip_sample_clean_df = citibike_trip_sample_clean_df.query('starttime<stoptime')
print(citibike_trip_sample_clean_df.shape)
citibike_trip_sample_clean_df = citibike_trip_sample_clean_df.query('tripduration < 24*60*60')
print(citibike_trip_sample_clean_df.shape)
## Most customers don't put birth year when rent a bike. And this is not the most essential info when it comes to usage forecast
# citibike_trip_sample_clean_df = citibike_trip_sample_clean_df.query('user_age < 76')
# print(citibike_trip_sample_clean_df.shape)
(315785, 28)
(315784, 28)
(315712, 28)
In [39]:
citibike_trip_sample_df.head()
Out[39]:
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bikeid usertype birth_year gender customer_plan start_hour start_month start_year stop_hour stop_month stop_year user_age year_month month_name day_of_week start_date stop_date
0 2319 2016-03-09 13:08:00 2016-03-09 13:47:00 520 W 52 St & 5 Ave 40.759923 -73.976485 363 West Thames St 40.708347 -74.017134 23062 Subscriber 1972.0 male NaN 13 3 2016 13 3 2016 44.0 2016-03 Mar Wednesday 2016-03-09 2016-03-09
1 313 2015-07-09 15:42:00 2015-07-09 15:47:00 520 W 52 St & 5 Ave 40.759923 -73.976485 493 W 45 St & 6 Ave 40.756800 -73.982912 16909 Subscriber 1968.0 female NaN 15 7 2015 15 7 2015 47.0 2015-07 Jul Thursday 2015-07-09 2015-07-09
2 906 2016-01-11 18:32:00 2016-01-11 18:47:00 520 W 52 St & 5 Ave 40.759923 -73.976485 3162 W 78 St & Broadway 40.783400 -73.980931 15614 Subscriber 1961.0 male NaN 18 1 2016 18 1 2016 55.0 2016-01 Jan Monday 2016-01-11 2016-01-11
3 716 2013-10-30 11:53:00 2013-10-30 12:05:00 520 W 52 St & 5 Ave 40.759923 -73.976485 533 Broadway & W 39 St 40.752996 -73.987216 19280 Subscriber 1954.0 male NaN 11 10 2013 12 10 2013 59.0 2013-10 Oct Wednesday 2013-10-30 2013-10-30
4 312 2014-06-04 16:12:00 2014-06-04 16:17:00 520 W 52 St & 5 Ave 40.759923 -73.976485 519 E 42 St & Vanderbilt Ave 40.752416 -73.978370 16483 Subscriber 1963.0 male NaN 16 6 2014 16 6 2014 51.0 2014-06 Jun Wednesday 2014-06-04 2014-06-04

2.2 Citibike Daily Usage Forecast with All Usertypes¶

2.2.1 Time Series Decompostion and Forecast¶

1. Trend Component¶

  • Prophet models the trend (long-term increase or decrease) using a piecewise linear function:
    • The trend is modeled as a sequence of connected linear segments.
    • To ensure smooth transitions between these segments, change points are introduced where the slope can change.
    • Formula: $$ g(t) = \beta_0 + \beta_1 t + \sum_{i=1}^N s_i(t) \delta_i $$ Where:
      • $g(t)$: Trend function at time ( t ).
      • $\beta_0, \beta_1 $: Intercept and slope before the first change point.
      • $\delta_i $: Change in slope at change point ( i ).
      • $ s_i(t) $: Indicator function for whether ( t ) is after change point ( i ).

2. Seasonality Component¶

  • Prophet models seasonality using a Fourier series to capture repeating patterns (e.g., daily, weekly, or yearly cycles):

    • Formula: $$ s(t) = \sum_{n=1}^N \left( a_n \cos\left(\frac{2\pi n t}{P}\right) + b_n \sin\left(\frac{2\pi n t}{P}\right) \right) $$ Where:
      • $s(t)$: Seasonal component at time ( t ).
      • $P$: Period of the seasonality (e.g., 365.25 for yearly seasonality).
      • $ a_n, b_n $: Fourier coefficients estimated from the data.
      • $ N $: Number of Fourier terms (determines the flexibility of the seasonal model).
  • By using the Fourier series, Prophet can model complex, periodic patterns while maintaining computational efficiency.


3. Combined Model¶

  • The final model is a sum of the trend and seasonal components: $$ y(t) = g(t) + s(t) + \epsilon_t $$ Where:
    • $ y(t) $: Observed value at time ( t ).
    • $ g(t) $: Trend component.
    • $ s(t) $: Seasonal component.
    • $ \epsilon_t $: Residual noise/error.

Modeling Residuals and Prediction Intervals¶

After decomposing the time series into trend and seasonality, Prophet models the residuals (( \epsilon_t )) to account for uncertainty and generate prediction intervals.

  1. Residual Assumption:

    • The residuals are assumed to follow a normal distribution centered around zero: $$ \epsilon_t \sim N(0, \sigma^2) $$ Where:
      • $ \sigma^2 $: Variance of the residuals.
  2. Estimating Prediction Intervals:

    • Based on the residuals' variance, Prophet calculates the uncertainty in the forecasts by generating confidence intervals (e.g., 95% prediction intervals).
    • Formula for prediction interval: $$ \hat{y}(t) \pm z \cdot \hat{\sigma} $$ Where:
      • $ \hat{y}(t) $: Predicted value at time ( t ).
      • $ z $: Z-score corresponding to the desired confidence level (e.g., ( z = 1.96 ) for 95% confidence).
      • $ \hat{\sigma} $: Standard deviation of the residuals.
  3. Steps in Modeling Residuals:

    • Fit the model using the trend and seasonal components.
    • Calculate the residuals by subtracting the fitted values from the actual values.
    • Use the residuals to estimate their variance and construct prediction intervals.
  4. Implications for Forecasting:

    • These prediction intervals reflect the uncertainty in both the trend and seasonal components, providing realistic bounds for the forecasts.
    • They allow for more robust decision-making by accounting for variability in the data.

Key Takeaways¶

  1. Seasonality via Fourier Transformation:

    • Fourier series efficiently represent complex, periodic patterns without overfitting.
    • It allows for flexible modeling of both short-term (e.g., weekly) and long-term (e.g., yearly) seasonality.
  2. Trend via Piecewise Linear Function:

    • The trend is dynamic, adapting to changes in data behavior using change points.
    • This makes the model robust to sudden shifts in trend, such as those caused by external factors.
  3. Residuals and Prediction Intervals:

    • Residual modeling ensures accurate representation of uncertainty.
    • Prediction intervals provide a realistic range for future values, making the forecasts more actionable and reliable.
In [40]:
def usage_forecast(df, output_filename):
  daily_trips = df.groupby(df['starttime'].dt.to_period('D')).size()
  daily_trips.index = daily_trips.index.to_timestamp()  # Convert to DatetimeIndex


  daily_trips = daily_trips.reset_index()
  # Ensure time series has a continuous index (fill missing days with 0 trips)
  # daily_trips = daily_trips.asfreq('D', fill_value=0)
  daily_trips.columns = ['ds', 'y']  # Rename columns for Prophet compatibility

  # Split data for training and testing
  train = daily_trips[(daily_trips['ds'] >= '2013-07-01') & (daily_trips['ds'] <= '2016-06-30')]
  test = daily_trips[(daily_trips['ds'] >= '2016-07-01') & (daily_trips['ds'] <= '2016-08-30')]

  # Initialize and train the Prophet model
  model = Prophet(interval_width=0.95) # Set to 95% confidence interval (default value is 80%)
  model.add_seasonality(name='weekly', period=7, fourier_order=3)
  model.add_seasonality(name='yearly', period=365.25, fourier_order=5)
  model.fit(train)

  # Make predictions for the test period
  future = model.make_future_dataframe(periods=365, freq='D')
  forecast = model.predict(future)

  # Extract predictions for the test period
  forecast_test = forecast[forecast['ds'].isin(test['ds'])]

  # Calculate evaluation metrics
  y_true = test['y'].values
  y_pred = forecast_test['yhat'].values

  # Mean Absolute Error (MAE)
  mae = mean_absolute_error(y_true, y_pred)

  # Mean Absolute Percentage Error (MAPE)
  mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100

  # Root Mean Squared Error (RMSE)
  rmse = mean_squared_error(y_true, y_pred, squared=False)

  # Root Mean Squared Percentage Error (RMSPE)
  rmspe = np.sqrt(np.mean(((y_true - y_pred) / y_true) ** 2)) * 100

  # Print metrics
  print(f"Mean Absolute Error (MAE): {mae:.2f}")
  print(f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%")
  print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
  print(f"Root Mean Squared Percentage Error (RMSPE): {rmspe:.2f}%")

 # Create side-by-side plots
  fig, axes = plt.subplots(1, 2, figsize=(18, 6))

  # Full Forecast Plot
  axes[0].set_title('Prophet Forecast: Daily Trips')
  fig_full = model.plot(forecast, ax=axes[0])
  axes[0].scatter(test['ds'], test['y'], color='orange', label='Test Data', zorder=3, s=10)  # Add test data points
  axes[0].set_xlabel('Date')
  axes[0].set_ylabel('Number of Trips')
  axes[0].legend()

  # Zoomed-In Forecast Plot
  axes[1].set_title('Prophet Forecast: Zoomed-In for Last Two Months')
  fig_zoom = model.plot(forecast, ax=axes[1])
  axes[1].plot(
      test['ds'], test['y'],
      color='orange',
      label='Test Data (Line)',
      zorder=2,
      linestyle='-',
      marker='o',
      linewidth=1  # Adjust line thickness
  )
  axes[1].set_xlim(pd.Timestamp('2016-07-01'), pd.Timestamp('2016-08-30'))
  axes[1].set_xlabel('Date')
  axes[1].set_ylabel('Number of Trips')
  axes[1].legend()

  # Adjust layout
  plt.tight_layout()
  plt.savefig(output_filename, dpi=300, bbox_inches='tight')
  plt.show()
In [41]:
usage_forecast(citibike_trip_sample_clean_df, "image/prediction_for_citibike_user_usage.png")
INFO:prophet:Found custom seasonality named 'yearly', disabling built-in 'yearly' seasonality.
INFO:prophet:Found custom seasonality named 'weekly', disabling built-in 'weekly' seasonality.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpk1t6jpc7/s5_49kqn.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpk1t6jpc7/2yp_0let.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=93518', 'data', 'file=/tmp/tmpk1t6jpc7/s5_49kqn.json', 'init=/tmp/tmpk1t6jpc7/2yp_0let.json', 'output', 'file=/tmp/tmpk1t6jpc7/prophet_modelbkfldulp/prophet_model-20241204204053.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
20:40:53 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
20:40:53 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
Mean Absolute Error (MAE): 50.96
Mean Absolute Percentage Error (MAPE): 12.67%
Root Mean Squared Error (RMSE): 67.11
Root Mean Squared Percentage Error (RMSPE): 19.09%
No description has been provided for this image
In [43]:
# Aggregate daily data
subscriber_citibike_trip_sample_clean_df = citibike_trip_sample_clean_df.query('usertype=="Subscriber"')
customer_citibike_trip_sample_clean_df = citibike_trip_sample_clean_df.query('usertype=="Customer"')
In [46]:
usage_forecast(subscriber_citibike_trip_sample_clean_df, "image/prediction_for_citibike_subscriber_usage.png")
INFO:prophet:Found custom seasonality named 'yearly', disabling built-in 'yearly' seasonality.
INFO:prophet:Found custom seasonality named 'weekly', disabling built-in 'weekly' seasonality.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpk1t6jpc7/17qe_fqt.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpk1t6jpc7/lst7g6hf.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=15292', 'data', 'file=/tmp/tmpk1t6jpc7/17qe_fqt.json', 'init=/tmp/tmpk1t6jpc7/lst7g6hf.json', 'output', 'file=/tmp/tmpk1t6jpc7/prophet_modeliiz6uylb/prophet_model-20241204215056.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
21:50:56 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
21:50:56 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
Mean Absolute Error (MAE): 53.27
Mean Absolute Percentage Error (MAPE): 17.06%
Root Mean Squared Error (RMSE): 69.65
Root Mean Squared Percentage Error (RMSPE): 26.53%
No description has been provided for this image
In [45]:
usage_forecast(customer_citibike_trip_sample_clean_df, "image/prediction_for_citibike_customer_usage.png")
INFO:prophet:Found custom seasonality named 'yearly', disabling built-in 'yearly' seasonality.
INFO:prophet:Found custom seasonality named 'weekly', disabling built-in 'weekly' seasonality.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpk1t6jpc7/y7vd3_j4.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpk1t6jpc7/hd70wvgm.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=94701', 'data', 'file=/tmp/tmpk1t6jpc7/y7vd3_j4.json', 'init=/tmp/tmpk1t6jpc7/hd70wvgm.json', 'output', 'file=/tmp/tmpk1t6jpc7/prophet_model3qmx3aza/prophet_model-20241204204402.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
20:44:02 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
20:44:02 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
Mean Absolute Error (MAE): 17.33
Mean Absolute Percentage Error (MAPE): 28.24%
Root Mean Squared Error (RMSE): 22.23
Root Mean Squared Percentage Error (RMSPE): 36.40%
No description has been provided for this image

2.2.2 Interpretation and Comparison of Prediction Performance Metrics¶

1. General Observations¶

  • User Group (Overall):

    • MAE: 50.96, MAPE: 12.67%, RMSE: 67.11, RMSPE: 19.09%
    • The overall metrics for the user group indicate relatively good prediction accuracy, with a lower MAPE and RMSPE compared to the subscriber and customer subgroups. This suggests that, on average, the model is able to predict the total number of trips with reasonable accuracy.
  • Subscriber Group:

    • MAE: 53.27, MAPE: 17.06%, RMSE: 69.65, RMSPE: 26.53%
    • Subscribers have a higher MAPE and RMSPE compared to the overall user group. This suggests that there is greater relative error in predicting subscriber trips, likely due to more variability in their behavior across different days.
  • Customer Group:

    • MAE: 17.33, MAPE: 28.24%, RMSE: 22.23, RMSPE: 36.40%
    • Customers have the lowest MAE and RMSE in absolute terms but the highest MAPE and RMSPE. This indicates that while the absolute errors are smaller (due to smaller trip counts), the relative error is significantly higher. Customer behavior may be harder to predict due to their weekend-dominant usage pattern and other unpredictable factors (e.g., tourism or leisure).

2. Why is MAPE and RMSPE Lower for the Overall User Group?¶

  • Cancellation of Outliers:
    • The lower MAPE and RMSPE for the total user group can be explained by the offsetting effect of outliers in the subscriber and customer groups:
      • On weekdays, subscribers dominate the trip counts, while on weekends, customers dominate.
      • The combined totals for the overall user group smooth out these fluctuations and outliers, resulting in lower percentage errors for the aggregated data.

3. Potential Improvements¶

  • Adding External Variables:
    • To further explain the variance in residuals, we can incorporate external factors into the model, such as:
      • Weather Data: Temperature, precipitation, wind speed, etc., as weather conditions directly impact bike usage.
      • Holidays: National or local holidays may affect the behavior of both subscribers and customers (e.g., fewer commuters, more tourists).
      • Marketing Events: Promotions or campaigns by CitiBike could lead to spikes in usage.
    • Including these variables can help explain patterns and reduce the unexplained variance in the residuals, leading to more accurate predictions.

4. Understanding Variance in Residuals¶

  • Variance in Residuals:
    • Residual variance reflects the unexplained information in the data, which cannot be captured by the current model structure or features.
    • High variance suggests the presence of hidden factors or complexity in the data that the model has not accounted for.
    • By minimizing this variance (e.g., adding explanatory variables), the model can improve its predictive performance.

Key Takeaways¶

  1. Performance by Group:

    • The overall user group has lower relative errors (MAPE and RMSPE) due to the smoothing effect of subscriber and customer fluctuations.
    • Subscribers are relatively more predictable but still exhibit moderate percentage errors.
    • Customers are harder to predict, with significantly higher percentage errors due to their variability.
  2. Potential for Improvement:

    • Incorporating external factors (e.g., weather, holidays, marketing events) can help explain residual variance and improve prediction accuracy.
  3. Interpreting Variance:

    • Residual variance represents hidden or unexplained information in the data. Reducing this variance requires enhancing the model with additional explanatory variables or structural changes.

3. Bike Station Insights¶

3.1 Find Out the Top Used Bike Stations¶

In [48]:
df = citibike_trip_sample_clean_df.copy()

# Calculate the usage for each station
start_usage = df.groupby(['start_station_id', 'start_station_name', 'start_station_latitude', 'start_station_longitude']).size().reset_index(name='start_count')
end_usage = df.groupby(['end_station_id', 'end_station_name', 'end_station_latitude', 'end_station_longitude']).size().reset_index(name='end_count')

# Merge start and end usage to calculate total usage for each station
station_usage = pd.merge(start_usage, end_usage,
                         left_on='start_station_id', right_on='end_station_id',
                         how='outer')

# Fill NaN values with 0 and calculate total usage
station_usage.fillna({'start_count': 0, 'end_count': 0}, inplace=True)
station_usage['total_usage'] = station_usage['start_count'] + station_usage['end_count']

# Merge latitude and longitude information
station_usage['latitude'] = station_usage['start_station_latitude'].where(
    ~station_usage['start_station_latitude'].isnull(), station_usage['end_station_latitude'])
station_usage['longitude'] = station_usage['start_station_longitude'].where(
    ~station_usage['start_station_longitude'].isnull(), station_usage['end_station_longitude'])

# Use `start_station_name` if available; otherwise, fall back to `end_station_name`
station_usage['station_name'] = station_usage['start_station_name'].where(
    station_usage['start_station_name'] != '', station_usage['end_station_name'])
In [49]:
# Sort by total usage and get the top 10 most used stations
top_stations = station_usage[['station_name', 'total_usage']].sort_values(by='total_usage', ascending=True).tail(10)

# Draw the horizontal bar chart
plt.figure(figsize=(10, 6))
plt.barh(top_stations['station_name'], top_stations['total_usage'])
plt.title('Top 10 Most Used Stations')
plt.xlabel('Total Usage')
plt.ylabel('Station Name')
plt.tight_layout()
# Save the figure as a PNG file
plt.savefig('image/top_10_most_used_stations.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [50]:
import plotly.express as px

# Get the top 100 most used stations
top_100_stations = station_usage[['station_name', 'latitude', 'longitude', 'total_usage']].sort_values(by='total_usage', ascending=False).head(100)

# Plot the top 100 stations on the map using Plotly
fig = px.scatter_mapbox(
    top_100_stations,
    lat="latitude",
    lon="longitude",
    size="total_usage",
    color="total_usage",
    hover_name="station_name",
    size_max=15,
    zoom=12,
    mapbox_style="carto-positron",
    title="Top 100 Most Used Citibike Stations"
)
fig.update_layout(
        width=900,  # Narrower width
        height=900  # Taller height
    )
fig.write_html("html/top_100_most_used_stations.html")
fig.show()

3.2 Citibike Station Usage Pattern Clustering¶

3.2.1 Kmean Clustering¶

K-Means is an unsupervised machine learning algorithm used for clustering data into ( k ) groups based on their similarities. It minimizes the variance within each cluster by iteratively adjusting the cluster centroids and assigning data points to the nearest centroid.

  • Objective: Minimize the sum of squared distances (inertia) between data points and their corresponding cluster centroids: $$ W(C) = \sum_{k=1}^K \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 $$ Where:

    • $K$: Number of clusters.
    • $C_k$: Cluster $k$.
    • $\mu_k$: Centroid of cluster $k$.
    • $x_i$: Data point in cluster $k$.
  • Steps:

    1. Initialize $K$ cluster centroids randomly.
    2. Assign each data point to the nearest centroid.
    3. Update centroids by computing the mean of all points in each cluster.
    4. Repeat steps 2-3 until convergence (e.g., centroids stop changing significantly).
In [54]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Add bike_in and bike_out columns
df['bike_out'] = -1  # Bikes leaving the station
df['bike_in'] = 1    # Bikes arriving at the station

# Calculate bike outflows (grouped by start hour)
outflow = df.groupby(['start_station_id', 'start_hour'])['bike_out'].sum().reset_index()

# Calculate bike inflows (grouped by stop hour)
inflow = df.groupby(['end_station_id', 'stop_hour'])['bike_in'].sum().reset_index()

# Rename columns for consistency
outflow.rename(columns={'start_station_id': 'station_id', 'start_hour': 'hour'}, inplace=True)
inflow.rename(columns={'end_station_id': 'station_id', 'stop_hour': 'hour'}, inplace=True)

# Combine inflow and outflow into a single DataFrame
bike_flow = pd.concat([outflow, inflow], axis=0, ignore_index=True)

# Group by station and hour to calculate net bike flow (inflow - outflow)
bike_flow = bike_flow.groupby(['station_id', 'hour']).sum().reset_index()
bike_flow['net_flow'] = bike_flow['bike_in'] + bike_flow['bike_out']

# Prepare data for clustering: pivot to have stations as rows and net flow as columns
time_series_matrix = bike_flow.pivot_table(
    index='station_id',
    columns='hour',
    values='net_flow',
    fill_value=0
)

# Standardize the data for clustering
scaler = StandardScaler()
time_series_scaled = scaler.fit_transform(time_series_matrix)
In [55]:
bike_flow.head()
Out[55]:
station_id hour bike_out bike_in net_flow
0 116 0 -15.0 16.0 1.0
1 116 1 -10.0 14.0 4.0
2 116 2 -9.0 12.0 3.0
3 116 3 -3.0 3.0 0.0
4 116 4 -7.0 3.0 -4.0
In [56]:
time_series_matrix
Out[56]:
hour 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
station_id
116 1.0 4.0 3.0 0.0 -4.0 -7.0 6.0 7.0 2.0 39.0 10.0 1.0 9.0 12.0 -22.0 -9.0 -32.0 -30.0 -7.0 -8.0 3.0 20.0 17.0 -1.0
119 2.0 -1.0 1.0 0.0 0.0 0.0 -6.0 -3.0 0.0 -5.0 -5.0 -3.0 -3.0 0.0 1.0 1.0 1.0 3.0 9.0 -5.0 3.0 5.0 1.0 3.0
120 2.0 2.0 0.0 1.0 -1.0 -2.0 -4.0 -11.0 -26.0 -19.0 -3.0 -3.0 -3.0 -5.0 5.0 0.0 -2.0 -3.0 7.0 10.0 13.0 3.0 7.0 6.0
127 -8.0 0.0 1.0 2.0 2.0 -7.0 -10.0 -33.0 -31.0 65.0 4.0 -28.0 -8.0 -13.0 11.0 -4.0 -1.0 27.0 13.0 16.0 -6.0 3.0 9.0 -4.0
128 5.0 0.0 -6.0 1.0 3.0 -9.0 0.0 14.0 58.0 33.0 4.0 -3.0 8.0 13.0 -27.0 9.0 -7.0 -22.0 -37.0 -31.0 22.0 10.0 9.0 5.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
546 18.0 3.0 2.0 -2.0 1.0 20.0 -11.0 7.0 40.0 49.0 9.0 -13.0 -9.0 6.0 -7.0 -19.0 -41.0 -56.0 -36.0 1.0 13.0 6.0 2.0 -3.0
72 3.0 -2.0 3.0 0.0 2.0 8.0 1.0 0.0 -72.0 -5.0 -7.0 -9.0 34.0 -6.0 -11.0 -7.0 2.0 6.0 17.0 -4.0 16.0 26.0 6.0 7.0
79 -3.0 1.0 -2.0 -2.0 3.0 1.0 -1.0 -1.0 34.0 57.0 35.0 11.0 -8.0 18.0 -6.0 7.0 -9.0 -20.0 -29.0 -4.0 -7.0 4.0 -8.0 -7.0
82 8.0 2.0 1.0 1.0 3.0 1.0 -5.0 -5.0 -18.0 -5.0 -10.0 0.0 4.0 -1.0 12.0 -2.0 -12.0 3.0 18.0 5.0 11.0 7.0 -2.0 11.0
83 -5.0 1.0 0.0 1.0 0.0 0.0 -8.0 -4.0 6.0 -6.0 -1.0 -2.0 0.0 4.0 8.0 -7.0 14.0 3.0 13.0 8.0 2.0 -2.0 -13.0 -3.0

592 rows × 24 columns

In [57]:
# Determine optimal number of clusters using silhouette analysis
silhouette_scores = []
k_range = range(2, 10)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    cluster_labels = kmeans.fit_predict(time_series_scaled)
    score = silhouette_score(time_series_scaled, cluster_labels)
    silhouette_scores.append(score)

# Plot the silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(k_range, silhouette_scores, marker='o')
plt.title('Silhouette Scores for K-Means Clustering')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()

silhouette_score_optimal_k = k_range[np.argmax(silhouette_scores)]
print(f"Optimal number of clusters based on silhouette score: {silhouette_score_optimal_k}")
No description has been provided for this image
Optimal number of clusters based on silhouette score: 2
In [58]:
def elbow_method(data, k_range):
    """
    Determine the optimal number of clusters using the elbow method.

    Parameters:
    - data: The dataset to cluster (e.g., scaled time-series matrix).
    - k_range: A range of possible cluster numbers to evaluate.

    Returns:
    - inertia_scores: A list of inertia scores for each value of k.
    """
    inertia_scores = []

    # Compute KMeans for each value of k
    for k in k_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(data)
        inertia_scores.append(kmeans.inertia_)  # Append the inertia (sum of squared distances)

    # Plot the inertia scores against k
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, inertia_scores, marker='o')
    plt.title('Elbow Method to Determine Optimal k')
    plt.xlabel('Number of Clusters (k)')
    plt.ylabel('Inertia (Sum of Squared Distances)')
    plt.grid()
    plt.show()

    return inertia_scores

# Example usage
k_range = range(1, 11)  # Test k from 1 to 10
inertia_scores = elbow_method(time_series_scaled, k_range)
No description has been provided for this image
In [66]:
# Choose optimal number of clusters (based on silhouette score or elbow method)

elbow_method_optimal_k = 3
print(f"Optimal number of clusters based on elbow method: {elbow_method_optimal_k}")
Optimal number of clusters based on elbow method: 3

3.2.2 Cluster Selection Methods and Justification for ( k = 3 )¶

1. Elbow Method The elbow method helps determine the optimal number of clusters (( k )) by plotting the inertia (sum of squared distances between data points and their nearest cluster center) against the number of clusters. The goal is to identify the "elbow point" where adding more clusters results in a marginal decrease in inertia.

  • Formula for Inertia: $$ W(C) = \sum_{k=1}^K \sum_{x_i \in C_k} \|x_i - \mu_k\|^2 $$ Where:
    • $K$: Number of clusters.
    • $C_k$: Cluster $k$.
    • $\mu_k$: Centroid of cluster $k$.
    • $x_i$: Data point in cluster $k$.

The "elbow" represents a trade-off between reducing within-cluster variance and keeping the model simple.


2. Silhouette Method¶

The silhouette method measures the quality of clustering by evaluating how well each data point fits within its cluster compared to the nearest neighboring cluster. The silhouette score ranges from $-1$ to $1$, where higher values indicate better-defined clusters.

  • Formula for Silhouette Score: $$ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} $$ Where:
    • $a(i)$: Average distance of point $i$ to other points in the same cluster.
    • $b(i)$: Average distance of point $i$ to points in the nearest neighboring cluster.

A higher silhouette score suggests that clusters are well-separated and internally cohesive.


3. Why Choose ( k = 3 )?¶

While the silhouette method suggests ( k = 2 ) as the optimal number of clusters, I ultimately chose ( k = 3 ) for the following reasons:

  1. Elbow Method Suggests Larger ( k ):

    • The elbow point in the inertia plot is not clear-cut but occurs around ( k = 3 ) or slightly higher, indicating that ( k = 3 ) balances simplicity and improved clustering performance.
  2. Representative Usage Patterns:

    • When plotting the typical usage patterns for each cluster, using ( k = 3 ) yields clusters that are more representative and interpretable.
    • With ( k = 3 ), each cluster captures distinct and meaningful behaviors, making the clustering more actionable.

Conclusion¶

Choosing ( k = 3 ) allows for a balance between the mathematical recommendations from the silhouette and elbow methods and the practical interpretability of the clusters. This ensures that the clustering results are not only technically sound but also meaningful for understanding usage patterns.

In [62]:
# Perform clustering with the optimal number of clusters
kmeans = KMeans(n_clusters=elbow_method_optimal_k, random_state=42)
station_clusters = kmeans.fit_predict(time_series_scaled)

# Add cluster labels to stations
clustered_stations = pd.DataFrame({
    'station_id': time_series_matrix.index,
    'cluster': station_clusters
})
Optimal number of clusters based on elbow method: 3
In [63]:
clustered_stations.cluster.value_counts()
Out[63]:
count
cluster
2 435
0 91
1 66

In [65]:
# Add cluster labels to the scaled data for visualization
clustered_data = pd.DataFrame(time_series_scaled, index=time_series_matrix.index)
clustered_data['cluster'] = station_clusters

# Plot the average pattern and all others for each cluster
for cluster_id in sorted(clustered_data['cluster'].unique()):
    cluster_members = clustered_data[clustered_data['cluster'] == cluster_id].drop(columns=['cluster'])

    # Calculate the average pattern for the cluster
    cluster_average = cluster_members.mean(axis=0)

    plt.figure(figsize=(10, 6))

    # Plot all members in grey
    for _, member in cluster_members.iterrows():
        plt.plot(member.values, color='grey', alpha=0.3)

    # Plot the average in red
    plt.plot(cluster_average.values, color='red', label='Cluster Average', linewidth=2)

    # Add labels and title
    plt.title(f'Typical Shape for Cluster {cluster_id}')
    plt.xlabel('Hour')
    plt.ylabel('Scaled Net Bike Flow')
    plt.legend()
    plt.tight_layout()
    # Save the figure as a PNG file
    plt.savefig('image/typical_shape_for_cluster_{}.png'.format(cluster_id), dpi=300, bbox_inches='tight')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

3.2.3 Interpretation of the Citybike Station Usage Patterns in Each Cluster¶

Based on the analysis of the typical usage patterns for the three clusters of CitiBike stations:

Cluster 0

  • Description:
    • This cluster likely represents stations located near office areas.
  • Pattern:
    • In the morning hours, there is a strong inflow of bikes as people ride bikes to work.
    • In the afternoon/evening, there is a strong outflow as people take bikes out to return home.
  • Interpretation:
    • These stations are characterized by their role as destinations during the morning commute and sources during the evening commute.

Cluster 1

  • Description:
    • This cluster likely represents stations near residential areas or subway hubs.
  • Pattern:
    • In the morning, there is a strong outflow of bikes as people take bikes to commute to work or connect to public transport.
    • In the afternoon/evening, there is a strong inflow as people return bikes after finishing their day.
  • Interpretation:
    • These stations act as sources in the morning and destinations in the evening, complementing the usage pattern of Cluster 0.

Cluster 2

  • Description:
    • This cluster represents stations with balanced inflow and outflow throughout the day.
  • Pattern:
    • The net flow remains close to zero across all hours, indicating that the inflow and outflow of bikes are almost equal at any time.
  • Interpretation:
    • These stations may be located in areas with a mix of residential, office, and recreational spaces or serve as intermediate transit points.

Others¶

In [67]:
!jupyter nbconvert --to html Citibike_Case_Study.ipynb
[NbConvertApp] WARNING | pattern 'Citibike_Case_Study.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place,
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--coalesce-streams
    Coalesce consecutive stdout and stderr outputs into one stream (within each cell).
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --CoalesceStreamsPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document.
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
    Whether to allow downloading chromium if no suitable version is found on the system.
    Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
    Disable chromium security sandbox when converting to PDF..
    Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
    Shows code input. This flag is only useful for dejavu users.
    Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
    Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
    Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
    Whether the HTML in Markdown cells and cell outputs should be sanitized..
    Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'qtpdf', 'qtpng', 'rst', 'script', 'slides', 'webpdf']
            or a dotted object name that represents the import path for an
            ``Exporter`` class
    Default: ''
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
    Name of the template file to use
    Default: None
    Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
    Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
    as prebuilt extension for the lab template)
    Default: 'light'
    Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
    Whether the HTML in Markdown cells and cell outputs should be sanitized.This
    should be set to True by nbviewer or similar tools.
    Default: False
    Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
    Writer class used to write the
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    Overwrite base name use for output files.
                Supports pattern replacements '{notebook_name}'.
    Default: '{notebook_name}'
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy
            of reveal.js.
            For speaker notes to work, this must be a relative path to a local
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb --to html

            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'qtpdf', 'qtpng', 'rst', 'script', 'slides', 'webpdf'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic', 'lab' and
            'classic'. You can specify the flavor of the format used.

            > jupyter nbconvert --to html --template lab mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.

In [ ]:
!jupyter nbconvert --to html --no-input Citibike_Case_Study.ipynb
[NbConvertApp] Converting notebook Citibike_Case_Study.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 11 image(s).
[NbConvertApp] Writing 3183505 bytes to Citibike_Case_Study.html
In [ ]: